Pdf text recognition phase1. basic text import and layout #135

olivetthered · 2020-06-21T07:13:30Z

Basic implementation of importing text from pdf files.
No fonts or styling yet, or second passes over grouping text areas together etc... just some basic text area grouping and layout but enough for additional features to be implemented fairly independently of each other. Currently, only support single page, select import text as text option (as opposed to the default import text as vector) in the GUI when importing a vector file of pdf type to import the text from a pdf file as text.

imnport text from a pdf document with some fuzzy matching to put lines of text that appear to be;long together in the same textframe. layout is good but there's no font or styling support as of yet and rotated text isn't supported either. creats lots of text boxes if the pdf file reports lots of text regions, they also need joining up in a second pass to merge textregions that should be together regardlesds of what the pdf file is reporting.

UI for selecting text import as either vectors (dewfault) or as text. There will need to be some more variables for text import so the user can configure how loose or strict the text block matching is as I doub't even with good guesses it won't be a one size fits all solution.

olivetthered · 2020-06-21T07:13:59Z

Pending file review by ale

olivetthered · 2020-06-21T12:21:35Z

I raised the following bug to have this pull request reviewed and integrated:
https://bugs.scribus.net/view.php?id=16142

implement text import as a new outputdev inheriting slaOutputdev and making the appropriate private members of slaOutptutDev protected

tidy up so we make minimul changes from master

fixed some space differences with master

override type3 font output as we don't want to get confused and try to render them as vectors when vector rendering is only partially functional due to overrides from slaoutputdev. Hopefully they can be implemneted in the same way as addChar but if that turns out to be infeasable the overrtides can be removed and they can get rendered as vectors in the finished implementation.

…taken change the name of TextOutputDev to PdfTextOutputDev as it's already taken the PdfTextOutputDev naming matches tjhe naming of PdfTextRecognition

…varialbes to make the classes and memb ers iuniform accrtoss the pdfTextRecognition implementation remane all the classes and member variables and function so they start with pdf ext unless it's not appropriate.

moved the optpuit dev into the pdftextrecognition files meaning slaoutput dev files longer have any dependencies on pdftextrecognition. This now keeps things neet and tody and a;l together.

sync with upstream master

…rt-and-layout

fix z-order/grouping. I don't know why I did this in the first place

olivetthered added 2 commits June 21, 2020 07:16

olivetthered added 12 commits June 21, 2020 14:58

implement text import as a new outputdev

61e7500

implement text import as a new outputdev inheriting slaOutputdev and making the appropriate private members of slaOutptutDev protected

make minimul changes from master

1557c17

tidy up so we make minimul changes from master

make minimul changes from master

0adb75a

fixed some space differences with master

change the name of TextOutputDev to PdfTextOutputDev as it's already …

426e7bd

…taken change the name of TextOutputDev to PdfTextOutputDev as it's already taken the PdfTextOutputDev naming matches tjhe naming of PdfTextRecognition

moved all of pdftextrecognition into the pdftextrecognition files

e06bd81

moved the optpuit dev into the pdftextrecognition files meaning slaoutput dev files longer have any dependencies on pdftextrecognition. This now keeps things neet and tody and a;l together.

Merge pull request #11 from scribusproject/master

e953aec

sync with upstream master

Merge branch 'master' into pdfTextRecognition-phase1.-basic-text-impo…

d58166d

…rt-and-layout

add some braces in linearTest and fix a couple of typos

7e58ab7

set the correct ycoord so we can support mutiple pages

5547e61

fix z-order/grouping

06068e4

fix z-order/grouping. I don't know why I did this in the first place

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pdf text recognition phase1. basic text import and layout #135

Pdf text recognition phase1. basic text import and layout #135

olivetthered commented Jun 21, 2020

olivetthered commented Jun 21, 2020

olivetthered commented Jun 21, 2020

Pdf text recognition phase1. basic text import and layout #135

Are you sure you want to change the base?

Pdf text recognition phase1. basic text import and layout #135

Conversation

olivetthered commented Jun 21, 2020

olivetthered commented Jun 21, 2020

olivetthered commented Jun 21, 2020